0.1 Introduction

In response to the growing demand by Zillow for a housing market model that incorporates local context, we have been tasked with leverage data from multiple sources, including Philadelphia Open Data, U.S Census Bureau, and the American Community Survey, to gather comprehensive information that enables us to develop an accurate and reliable model for the Philadelphia housing market.

This report outlines our approach and methods and the importance of understand the local context that is integrated with pre-existing data. We aim to provide greater context and a precise understanding of Philadelphia’s housing scene for present and future use.

1 Get Data

In order to develop an enhanced understanding on the housing market, we acquired data that looks beyond the house itself, but the conditions and amenities that surround such, to explain the deeper meaning behind “location”. The first source was from the U.S Census Bureau, as a beginning point in obtaining geographic features for visual display, broad context on parcels and neighborhoods, and overall socioeconomic data.

The next source of data, similar to the U.S Census, was leveraged by Social Explorer. This website takes U.S Census data and creates a more user-friendly and comprehensive database that allowed us to efficiently seek neighborhood-level information that we can use to experiment with our model. The website fostered our ideas of potential indicators of sale price beyond the housing market, rather amenities surrounding it. This exploration further cascaded to investigate less conventional attributes to a neighborhood that can summarize aspects in a specified quantitative measurement. When taking all of these variables into account, we were able to predict with substantial accuracy the prices of various homes throughout Philadelphia.

1.1 Census Data

We download the following data from the census: Population Estimate, Total Housing Units, Total Vacant Units, Median Household Income, Households with Public Assisted Income, Owner-Occupied Housing Units, White Homeowners, and Total Occupied Housing Units. These variables provide a basic overview of housing in Philadelphia but are not sufficient alone in predicting future sale prices.

census_api_key("2ad9e737f3d9062836cb46bb568be5467f86d3db", overwrite = TRUE)

1.2 Get Data from Open Data Philly

OpenData Philly is an open source website that provides a catalog of free data, officially sponsored by the City of Philadelphia. The data provided by the City and other organizations allows us to collect a wide variety of information we can utilize to categorize, describe, and develop a profile for each neighborhood. We used this website to gather

planning_districts <- st_read("https://opendata.arcgis.com/datasets/0960ea0f38f44146bb562f2b212075aa_0.geojson")%>%
  st_transform('EPSG:2272')
## Reading layer `Planning_Districts' from data source 
##   `https://opendata.arcgis.com/datasets/0960ea0f38f44146bb562f2b212075aa_0.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 18 features and 8 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -75.28031 ymin: 39.86747 xmax: -74.95575 ymax: 40.13793
## Geodetic CRS:  WGS 84
redevelopment_areas <- st_read("https://data-phl.opendata.arcgis.com/datasets/80f2c71305f5493c8e0aab9137354844_0.geojson") %>%
  dplyr::filter(TYPE == 'Redevelopment Area Plan and Blight Certification' & STATUS == 'Active') %>%
  st_transform('EPSG:2272')
## Reading layer `Redevelopment_Certified_Areas' from data source 
##   `https://data-phl.opendata.arcgis.com/datasets/80f2c71305f5493c8e0aab9137354844_0.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 71 features and 13 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -75.26513 ymin: 39.87596 xmax: -75.01541 ymax: 40.08466
## Geodetic CRS:  WGS 84
restaurants <- st_read('https://opendata.arcgis.com/datasets/53b8a1c653a74c92b2de23a5d7bf04a0_0.geojson')%>%
  st_transform('EPSG:2272') %>%
  dplyr::select(GEOID10,TOTAL_RESTAURANTS)
## Reading layer `370eb703-5a4f-4abb-8920-727cef31373b2020329-1-rcimn.5n39o' from data source `https://opendata.arcgis.com/datasets/53b8a1c653a74c92b2de23a5d7bf04a0_0.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 1336 features and 16 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -75.28031 ymin: 39.86747 xmax: -74.95575 ymax: 40.13793
## Geodetic CRS:  WGS 84
restaurants <- restaurants %>%
  mutate(rest_per_sqmi = as.numeric(TOTAL_RESTAURANTS / (st_area(restaurants)/358700000)))

1.3 Load House Sales Data from Github

The data here extracts the samples of house data we were provided prior to the our model and studies. The database and attributes included creates a base-model of indicators to predict house price that focus exclusively on the house’s features. Prior data mentioned focuses on the external factors while this data focuses on the internal. The attributes we look for in the house data needs to be quantifiable and relevant. Large sets of data were ranked and categorized which required us to manually set a value that can be calculated by the model. Beyond the various factors that can be considered for the predictive model, the general table will be filtered and selected to determine only the most important indicators.

house_data <- st_read("https://raw.githubusercontent.com/mafichman/musa_5080_2023/main/Midterm/data/2023/studentData.geojson") %>%
  dplyr::select("objectid","central_air","exterior_condition", "fireplaces", "garage_spaces", "interior_condition", "mailing_street", "location", "number_of_bathrooms", "number_of_bedrooms","number_of_rooms","number_stories","sale_price","sale_date","total_area","total_livable_area","year_built","toPredict") %>%
  st_transform('EPSG:2272')
## Reading layer `studentData' from data source 
##   `https://raw.githubusercontent.com/mafichman/musa_5080_2023/main/Midterm/data/2023/studentData.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 23881 features and 75 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -75.27367 ymin: 39.89162 xmax: -74.9618 ymax: 40.1377
## Geodetic CRS:  WGS 84
#Do some feature enginerring
house_data <- house_data %>%
  mutate(
    #Recode exterior condition so that four is best condition, group some categories together
    exterior_condition = as.numeric(recode(exterior_condition, '7'='1', '6'='2', '5'='3', '4'='3', '3'='4', '2'='4', '1'='4')), 
    #Assumed averager condition is value is NA
    exterior_condition = ifelse(is.na(exterior_condition),2,exterior_condition),
    #Set blank values to 0, assumed that if value is blank there are no fireplaces based on metadata.
    fireplaces = ifelse(is.na(fireplaces),0,fireplaces),
    #Assumed there is always at least one bathroom - if property has 0 bathrooms assigned a value of 1.
    number_of_bathrooms = ifelse(number_of_bathrooms == 0,1,number_of_bathrooms),
    number_of_bathrooms = ifelse(is.na(number_of_bathrooms),1,number_of_bathrooms),
    #Assumed averager condition is value is NA
    interior_condition = ifelse(is.na(interior_condition),4,interior_condition))

1.4 Load School Test Score Data

We look at the percent of students at public schools who had proficient or advanced scores on the PSSA/Keystone test. The PSSA/Keystone includes Science, Math, and English Language Arts score. For each school, we averaged together the students performance on the three sections to get a single number for each school. We then determined the nearest school to each home and assigned the home the test score value of that school.

1.5 Get Data on number of Trees from Open Data Philly

We download data on trees as points and calculate the number of trees per square mile in each census tract. Trees are a way of indicating aesthetic appeal, environmental benefits, and shade. We extracted this data from OpenDataPhilly and joined the data from such to the census tracts.

#Get Data on trees in Philadelphia
trees <- st_read('https://opendata.arcgis.com/api/v3/datasets/5abe042f2927486891c049cf064338cb_0/downloads/data?format=geojson&spatialRefId=4326&where=1%3D1')%>%
  st_transform('EPSG:2272')
## Reading layer `PPR_Tree_Inventory_2022' from data source 
##   `https://opendata.arcgis.com/api/v3/datasets/5abe042f2927486891c049cf064338cb_0/downloads/data?format=geojson&spatialRefId=4326&where=1%3D1' 
##   using driver `GeoJSON'
## Simple feature collection with 151164 features and 6 fields
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -75.28272 ymin: 39.87498 xmax: -74.95878 ymax: 40.13746
## Geodetic CRS:  WGS 84
#Calculate the number of trees per census tract and convert to trees per square mile to normalize
acsTractsPHL.2021 <- st_intersection(acsTractsPHL.2021, trees) %>% #Determine what census tract each tree is in using intersection
  st_drop_geometry() %>%
  group_by(GEOID) %>% summarize(tree_count = n()) %>% #Count the number of trees in each Census tract
  right_join(acsTractsPHL.2021, by='GEOID') %>% #Join tree cont to census tract Dataframe
  st_sf() %>%
  mutate(trees_per_mi = as.double(tree_count / (st_area(acsTractsPHL.2021)/358700000)))
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries
 #        ggplot()+
#  geom_sf(data=acsTractsPHL.2021, aes(fill=q5(trees_per_mi)))+
# geom_sf(data=planning_districts,fill='transparent',color='red',linewidth=1)

1.6 Urban Corridors

We calculate the distance to the nearest urban corridor. Urban corridors, for most houses, are considered proximity to amenities, decrease transportation costs, and has an overall higher quality of life. It has to also be noted that outliers are present in suburban housing too. For an urban development in general Philadelphia, being close to urban corridors would be considered an amenities, raising the home value. For suburban areas on the outskirts of the city, these car-dependent neighborhoods want to be away from corridors, preferring quietness, privacy, and distance.

corridors_url <- "https://opendata.arcgis.com/datasets/f43e5f92d34e41249e7a11f269792d11_0.geojson"
corridors <- st_read(corridors_url, quiet= TRUE) %>% st_transform('EPSG:2272')

nearest_fts <- st_nearest_feature(house_data,corridors)

house_data$dist_urb_corridor <- as.double(st_distance(house_data, corridors[nearest_fts,], by_element=TRUE))

1.7 Shootings

We calculate the number of shootings within a 1/2 mile and 1/4 mile radius of each home. Shootings can indicate home values because of their impact on the perception of crime in the area and the buyer’s concern for personal safety or property damage. This database was sourced directly from the Carto, another open source website.

shootings_url <- "https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+shootings&filename=shootings&format=geojson&skipfields=cartodb_id"
shootings <- st_read(shootings_url) %>% st_transform('EPSG:2272') %>%
  mutate(date_=as.Date(date_, format = "%d-%m-%Y"))  %>%
  dplyr::filter(date_ > '2023-01-01') %>%
  dplyr::select(location)
## Reading layer `OGRGeoJSON' from data source 
##   `https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+shootings&filename=shootings&format=geojson&skipfields=cartodb_id' 
##   using driver `GeoJSON'
## replacing null geometries with empty geometries
## Simple feature collection with 15057 features and 21 fields (with 437 geometries empty)
## Geometry type: POINT
## Dimension:     XY
## Bounding box:  xmin: -75.27362 ymin: 39.88409 xmax: -74.95936 ymax: 40.13117
## Geodetic CRS:  WGS 84
house_data <- st_intersection(shootings,st_buffer(house_data,2640)) %>%
  st_drop_geometry() %>%
  count(objectid) %>%
  rename(shooting_halfmile = n) %>%
  right_join(house_data, by='objectid') %>%
  mutate(shooting_halfmile = ifelse(is.na(shooting_halfmile),0,shooting_halfmile)) %>%
  st_sf()
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries
house_data <- st_intersection(shootings,st_buffer(house_data,2640/2)) %>%
  st_drop_geometry() %>%
  count(objectid) %>%
  rename(shooting_quartermile = n) %>%
  right_join(house_data, by='objectid') %>%
  mutate(shooting_quartermile = ifelse(is.na(shooting_quartermile),0,shooting_quartermile)) %>%
  st_sf()
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries

2 Join All Data to House Sales Data

After gathering data from various sources, we now have to coalesce them into one final database for our prediction model.

HDJoin <- st_intersection(house_data, restaurants %>%
                               dplyr::select("TOTAL_RESTAURANTS", "rest_per_sqmi"))
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries
HDJoin <- st_intersection(HDJoin, planning_districts %>%
                             dplyr::select("DIST_NAME"))
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries
HDJoin <- st_join(HDJoin,schools %>% dplyr::select('mean_profadv'), join=st_nearest_feature)

HDJoinRDAs <- HDJoin[st_intersects(HDJoin, redevelopment_areas) %>% lengths > 0, ] %>% 
  mutate(DevelopmentArea = 1)
  
NotRDAs <- HDJoin[!st_intersects(HDJoin, redevelopment_areas) %>% lengths > 0, ] %>% 
  mutate(DevelopmentArea = 0)

HDBind <- rbind(NotRDAs, HDJoinRDAs)

HDFinal <- st_intersection(HDBind, acsTractsPHL.2021 %>%
                             dplyr::select(-NAME)) %>%
                             dplyr::filter(toPredict == "MODELLING")
## Warning: attribute variables are assumed to be spatially constant throughout
## all geometries
C <- ggplot()+
  geom_sf(data=restaurants,aes(fill = q5(rest_per_mi)))

D <- ggplot()+
  geom_sf(data=restaurants,aes(fill = q5(TOTAL_RESTAURANTS)))

3 Summary Statistics

The table of Summary Statistics provides us a basis on the range of values and statistical information about each variable we have in our database. With all these factors considered, we can collectively use these variables to predict home prices and provide a visualization of how related specific variables are at indicating such.

HDFinal_nongeom <- HDFinal %>% st_drop_geometry()
HDFinal_nongeom <- HDFinal_nongeom %>%
  dplyr::select_if(is.numeric) %>%
  dplyr::select(-objectid, -totalPop, -year_built, -totalHU, -number_of_rooms, -TotalOccH)

stargazer(HDFinal_nongeom, type = 'text', title= "Table 1: Summary Statistics")
## 
## Table 1: Summary Statistics
## ==========================================================================
## Statistic              N       Mean      St. Dev.      Min         Max    
## --------------------------------------------------------------------------
## shooting_quartermile 23,779    3.764       5.461        0          45     
## shooting_halfmile    23,779   13.644      15.507        0          99     
## exterior_condition   23,779    3.168       0.460        1           4     
## fireplaces           23,779    0.048       0.289        0           5     
## garage_spaces        23,358    0.357       0.706        0          72     
## interior_condition   23,779    3.647       0.889        0           7     
## number_of_bathrooms  23,779    1.215       0.517        1           8     
## number_of_bedrooms   23,763    2.583       1.267        0          31     
## number_stories       23,771    1.958       0.551        1           5     
## sale_price           23,779 272,984.900 239,942.500   10,100    5,990,000 
## total_area           23,705  1,828.545   3,823.931      0        226,512  
## total_livable_area   23,779  1,333.585    549.196       0        15,246   
## dist_urb_corridor    23,779   670.238     638.845     0.000     7,383.925 
## TOTAL_RESTAURANTS    23,779    4.333       7.314        0          174    
## rest_per_sqmi        23,779  1,040.975   1,783.240    0.000    37,464.860 
## mean_profadv         23,779   28.770      17.352      4.285      92.333   
## DevelopmentArea      23,779    0.156       0.363        0           1     
## tree_count           23,779   379.244     246.684       40        1,832   
## totalVacant          23,779   202.342     136.300       0          691    
## medHHInc             23,779 59,797.020  28,650.770  11,955.000 210,322.000
## HHAssistedInc        23,779   513.927     358.376       0         2,048   
## OwnerOccH            23,779  1,094.438    475.065       0         2,685   
## WhiteHomeowner       23,779   880.528     698.197       0         2,706   
## HHOccupiedRate       23,779    0.514       0.159      0.000       0.877   
## WhiteHOrate          23,779    0.450       0.316      0.000       0.970   
## AssistedIncRate      23,779    0.243       0.154      0.000       0.681   
## trees_per_mi         23,779 27,030.210  28,708.110  1,069.279  215,615.900
## --------------------------------------------------------------------------

4 Correlation Matrix

A correlation matrix compares factors against one another to see how related they are at determining the others value. Each box presented can indicate whether there is a correlation (directly or inverse) or little relationship between them. We are able to see strong correlation between many factors. We are interested in all variables that show a significant relationship with sale price, specifically taking those with an absolute value of 0.3 or greater.

HDFinal_nongeom %>% 
  correlate() %>% 
  autoplot() +
  geom_text(aes(label = round(r,digits=2)), size = 3.5, order = "hclust", type = "upper", tl.cex = 3)
## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'
## Registered S3 methods overwritten by 'registry':
##   method               from 
##   print.registry_field proxy
##   print.registry_entry proxy
## Warning in geom_text(aes(label = round(r, digits = 2)), size = 3.5, order =
## "hclust", : Ignoring unknown parameters: `order`, `type`, and `tl.cex`

# Dependent Varaiable Correlations The four graphs below represent four key dependent variables that broadly cover each aspect of a neighborhood in order to determine house prices. Assited Income focuses on the homeowner’s socioeconomic status, resulting in an inverse relationship. Tree count and total livable area are directly correlated with a house’s desirability because of home size being considered a luxury and more trees often indicates better care for a neighborhood. The last dependent variable we look at is the white homeowner rate. White homeownership can indicate a “perceived” higher price value because of racial segregation, wealth inequality, and potential gentrification.

## Variables: total_livable_area, tree_count, WhiteHomeowner, AssistedIncRate
HDFinal_nongeom %>%
  dplyr::select(sale_price, total_livable_area, tree_count, WhiteHOrate, AssistedIncRate) %>%
  filter(sale_price <= 1000000) %>%
  gather(Variable, Value, -sale_price) %>% 
   ggplot(aes(Value, sale_price)) +
     geom_point(size = .5) + geom_smooth(method = "lm", se=F, colour = "#FA7800") +
     facet_wrap(~Variable, nrow = 1, scales = "free") +
     labs(title = "Price as a function of continuous variables") +
     plotTheme()
## Warning: The `size` argument of `element_rect()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
## ℹ Please use the `linewidth` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `geom_smooth()` using formula = 'y ~ x'

5 Map of Home Prices in Philadelphia

This map explores the variation in 2021 home prices across Philadelphia. We noticed the most expensive areas are located near Center City or far outskirts such as Chestnut Hill. We notice a ring-like effect in the city where the most expensive areas are either in the core center where all the activity is, or furthest away where there is the least interaction with the city.

HDFinal <- HDFinal %>% 
  mutate(sale_class = cut(sale_price, breaks = c(0, 250000, 500000, 750000, 1000000, max(HDFinal$sale_price, na.rm=TRUE))))

ggplot()+
    geom_sf(data=planning_districts,fill='grey80',color='transparent')+
    geom_sf(data=HDFinal, size=0.75,aes(colour = q5(sale_class)))+
    geom_sf(data=planning_districts,fill='transparent',color='black')+
  scale_color_manual(values = palette5,
                    name = "Sales Price (USD)",
                    na.value = 'grey80',
                    labels = c('$0-$250k', '$250k-$500k', '$500k-$750k', '$750k-$1m', '$1m+'))+
  mapTheme()

# Interesting Map 1, Assisted Income

This map shows the rate of people requiring assisted income in the city. We can observe that the central northeast, areas such as Kensington, Fairhill, and Feltonville are most reliant on government assistance for income. And the surrounding area may be affected by the poverty that is prominent there through perimeters that forms, surrounding the “yellow” core.

ggplot()+
  geom_sf(data=planning_districts,fill='grey80',color='transparent')+
  geom_sf(data=HDFinal,aes(colour = (AssistedIncRate * 100)),size=0.5)+
  scale_color_viridis(name = "Assisted Income Per 100 Households")+
  geom_sf(data=planning_districts,fill='transparent',color='black')+
mapTheme()

6 Interesting Map 2, Student Performance by School

The school system in a neighborhood can be a deal-breaker factor for families seeking a home in an area. The Keystone Exam is the State’s index of evaluating school performance and student’s education quality.

#profadv is the percentage of students recieved proficient OR advanced scores on Keystore Exam by Penn DOE
ggplot()+
  geom_sf(data=planning_districts,fill='grey80',color='transparent')+
  geom_sf(data=schools,aes(colour = q5(mean_profadv)),size=2.5)+
  scale_color_viridis_d(name = "Student Performance on Keystone Exam", labels = c('Poor', 'Below Average', 'Average', 'Above Average', 'Excellent'))+
  geom_sf(data=planning_districts,fill='transparent',color='black')+
mapTheme()

# Interesting Map 3, Heat Map of Shootings

We depicted where all shootings have occurred and created a heat map that shows the “hot spot” of where shootings most often occur. Similarly to seeing the education ratings & government assistance, we notice crime is clustered just north of the city center and we it has residual spillover in areas north of said cluster. Central City and areas south of the largest hot spot appear spared from gun violence.

ggplot()+
  geom_sf(data=planning_districts,fill='grey80',color='transparent')+
  stat_density2d(data = data.frame(st_coordinates(shootings)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') +
  scale_fill_gradient(low = "white", high = "red", name = "Shootings")+
  scale_alpha(guide = "none") +
  labs(title = "Density of Reported Shootings") +
  geom_sf(data=planning_districts,fill='transparent',color='black')+
  mapTheme()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..level..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(level)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: Removed 15 rows containing non-finite values (`stat_density2d()`).

# Might want to delete this later lmao idk

ggplot()+
  geom_sf(data=planning_districts,fill='grey90',color='transparent')+
  geom_sf(data=HDFinal,aes(colour = q5(sale_class)),size=0.5)+
  geom_sf(data=redevelopment_areas,fill='transparent',color='black')+
  scale_color_manual(values = palette5,
                    name = "Sales Price (USD)",
                    na.value = 'grey80',
                    labels = c('$0-$250k', '$250k-$500k', '$500k-$750k', '$750k-$1m', '$1m+'))+
  mapTheme()

# “Something Engaging” Interactive Map of Houseowner-Occupied Rate Provided below is an map that depicts where the Homeowner occupies the house they reside in, deferring from places where people rent. If you hover above the tract, you’ll see the exact rate of home ownership in the area. Other tools are available on the top right of the map.

ggplotly(
  ggplot(acsTractsPHL.2021)+
  geom_sf(aes(fill = HHOccupiedRate))+
  scale_fill_gradient(low = "black", high = "yellow", name = "Rate of Home Ownership")
)

7 Training and Test Sets

We have decided on a filtered list of key dependent variables to use to predict home values. Those factors being the following: shootings within 1/2 mile, interior condition, total livable area, neighborhood name, trees in the area, median household income, exterior condition, if there is a fireplace, white homeownership rate, assisted income rate, restaurants per square mile, and the number of bathrooms.

inTrain <- createDataPartition(
              y = paste(HDFinal$DIST_NAME), 
              p = .60, list = FALSE)
Philly.training <- HDFinal[inTrain,] 
Philly.test <- HDFinal[-inTrain,]  
 
reg.training <- 
  lm(sale_price ~ ., data = as.data.frame(Philly.training) %>% 
                             dplyr::select(sale_price, shooting_halfmile, interior_condition, total_livable_area, 
                                           DIST_NAME, mean_profadv,
                                           DevelopmentArea, tree_count, medHHInc, 
                                           exterior_condition, fireplaces, WhiteHOrate, AssistedIncRate,
                                           rest_per_sqmi, number_of_bathrooms))
summary(reg.training)
## 
## Call:
## lm(formula = sale_price ~ ., data = as.data.frame(Philly.training) %>% 
##     dplyr::select(sale_price, shooting_halfmile, interior_condition, 
##         total_livable_area, DIST_NAME, mean_profadv, DevelopmentArea, 
##         tree_count, medHHInc, exterior_condition, fireplaces, 
##         WhiteHOrate, AssistedIncRate, rest_per_sqmi, number_of_bathrooms))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1646927   -54701     -716    50017  4217280 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    7.852e+04  2.021e+04   3.886 0.000102 ***
## shooting_halfmile             -1.088e+03  1.223e+02  -8.896  < 2e-16 ***
## interior_condition            -3.034e+04  1.883e+03 -16.112  < 2e-16 ***
## total_livable_area             1.924e+02  2.642e+00  72.837  < 2e-16 ***
## DIST_NAMECentral Northeast    -1.828e+05  8.302e+03 -22.020  < 2e-16 ***
## DIST_NAMELower Far Northeast  -1.976e+05  8.180e+03 -24.159  < 2e-16 ***
## DIST_NAMELower North          -1.671e+05  8.025e+03 -20.826  < 2e-16 ***
## DIST_NAMELower Northeast      -1.821e+05  8.196e+03 -22.213  < 2e-16 ***
## DIST_NAMELower Northwest      -2.247e+05  7.022e+03 -31.999  < 2e-16 ***
## DIST_NAMELower South          -2.091e+05  2.479e+04  -8.437  < 2e-16 ***
## DIST_NAMELower Southwest      -1.986e+05  1.035e+04 -19.189  < 2e-16 ***
## DIST_NAMENorth                -1.995e+05  8.321e+03 -23.978  < 2e-16 ***
## DIST_NAMENorth Delaware       -1.961e+05  7.461e+03 -26.288  < 2e-16 ***
## DIST_NAMERiver Wards          -2.086e+05  6.805e+03 -30.653  < 2e-16 ***
## DIST_NAMESouth                -1.539e+05  6.031e+03 -25.522  < 2e-16 ***
## DIST_NAMEUniversity Southwest -1.719e+05  9.871e+03 -17.415  < 2e-16 ***
## DIST_NAMEUpper Far Northeast  -2.174e+05  9.030e+03 -24.078  < 2e-16 ***
## DIST_NAMEUpper North          -1.917e+05  8.390e+03 -22.850  < 2e-16 ***
## DIST_NAMEUpper Northwest      -2.051e+05  8.470e+03 -24.213  < 2e-16 ***
## DIST_NAMEWest                 -1.761e+05  8.663e+03 -20.330  < 2e-16 ***
## DIST_NAMEWest Park            -2.003e+05  1.152e+04 -17.393  < 2e-16 ***
## mean_profadv                   2.148e+03  1.042e+02  20.613  < 2e-16 ***
## DevelopmentArea                5.295e+03  3.969e+03   1.334 0.182218    
## tree_count                     3.629e+01  6.430e+00   5.643 1.70e-08 ***
## medHHInc                       3.661e-01  8.327e-02   4.397 1.10e-05 ***
## exterior_condition             1.644e+04  3.399e+03   4.836 1.34e-06 ***
## fireplaces                     7.413e+04  4.498e+03  16.480  < 2e-16 ***
## WhiteHOrate                    6.230e+04  9.989e+03   6.237 4.60e-10 ***
## AssistedIncRate               -4.526e+04  1.549e+04  -2.922 0.003486 ** 
## rest_per_sqmi                  7.365e+00  7.966e-01   9.245  < 2e-16 ***
## number_of_bathrooms            4.696e+04  2.862e+03  16.405  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 141200 on 14244 degrees of freedom
## Multiple R-squared:  0.6768, Adjusted R-squared:  0.6761 
## F-statistic: 994.2 on 30 and 14244 DF,  p-value: < 2.2e-16

8 Regression Training

For this process we are now teaching the model to predict the housing prices. We have provided data points as mentioned prior each with their own set of values correlated to sale price. Now, the model learns that relationship and produces its own set of values when given new data.

For this table below specifically, you see how far off it is from the actual values it compares to, known as their absolute error. Because we are working with large value commodities (i.e housing), the increment the model is off by can be quite large. It is equally important to look at the percentage value to see the relative amount it is off by as well.

Philly.test <-
  Philly.test %>%
  mutate(Regression = "Baseline Regression",
    sale_price.Predict = predict(reg.training, Philly.test),
         sale_price.Error = sale_price.Predict - sale_price,
         sale_price.AbsError = abs(sale_price.Predict - sale_price),
         sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price.Predict)%>%
  filter(sale_price < 5000000)

Philly.test %>% 
  st_drop_geometry() %>%
  summarise(MAE = mean(sale_price.AbsError),
            MAPE = abs(mean(sale_price.APE)*100)) %>%
  kbl(col.name=c('Mean Absolute Error','Mean Absolute Percentage Error')) %>%
  kable_classic()
Mean Absolute Error Mean Absolute Percentage Error
75709.16 35.79384

9 Nearest Neighbor

A Nearest Neighbor analysis is our way of identifying the distribution of points in a given space, whether by measuring distance or a specific value. For our analysis, we can use Nearest Neighbor values to additionally predict sale prices, as neighboring homes would tend to feature the same attributes as one another, therefore, can have a similar price range.

coords <- st_coordinates(HDFinal) 

neighborList <- knn2nb(knearneigh(coords, 5))
## Warning in knearneigh(coords, 5): knearneigh: identical points found
## Warning in knearneigh(coords, 5): knearneigh: kd_tree not available for
## identical points
spatialWeights <- nb2listw(neighborList, style="W")

HDFinal$lagPrice <- lag.listw(spatialWeights, HDFinal$sale_price)
coords.test <-  st_coordinates(Philly.test) 

neighborList.test <- knn2nb(knearneigh(coords.test, 5))
## Warning in knearneigh(coords.test, 5): knearneigh: identical points found
## Warning in knearneigh(coords.test, 5): knearneigh: kd_tree not available for
## identical points
spatialWeights.test <- nb2listw(neighborList.test, style="W")
 
Philly.test %>% 
  mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error)) %>%
  ggplot()+
  geom_point(aes(x =lagPriceError, y =sale_price.Error))

# Morans I Moran’s I is a measurement in statistics that evaluates the tendency for data points with similar values to occur near one another. Similar to Nearest Neighbor to statistically prove that houses spatially close to each other would have similar sale prices. This type of measurement helps us identify high and low value clusters (i.e different neighborhoods and their perceived value).

In the graph below see a thin distribution in gray, those are the counts of houses and their sale prices we provided the regression model. The orange line represents the Moran’s I value, determining if there is any correlation between value and location. The observed Moran’s I is roughly 0.2, which is considered statistically significant and suggests that the spatial distribution is correlated.

moranTest <- moran.mc(Philly.test$sale_price.Error, 
                      spatialWeights.test, nsim = 999)

ggplot(as.data.frame(moranTest$res[c(1:999)]), aes(moranTest$res[c(1:999)])) +
  geom_histogram(binwidth = 0.01) +
  geom_vline(aes(xintercept = moranTest$statistic), colour = "#FA7800",size=1) +
  scale_x_continuous(limits = c(-1, 1)) +
  labs(title="Observed and permuted Moran's I",
       subtitle= "Observed Moran's I in orange",
       x="Moran's I",
       y="Count") +
  plotTheme()
## Warning: Removed 2 rows containing missing values (`geom_bar()`).

# Table of Average Predictions The table below shows the the average prediction between what the model created and the actual average price from the data we provided it. We see there is a reasonably strong level of accuracy between what is predicted and what is actual. What needs to be considered is that these are the averages, and that the predicted price in any district can vary widely.

Philly.test %>%
as.data.frame() %>%
  group_by(DIST_NAME) %>%
    summarize(meanPrediction = mean(sale_price.Predict),
              meanPrice = mean(sale_price)) %>%
      kable() %>% 
  kable_styling()
DIST_NAME meanPrediction meanPrice
Central 609764.50 579944.62
Central Northeast 287054.37 287035.44
Lower Far Northeast 286284.26 289399.58
Lower North 230382.84 229935.83
Lower Northeast 174741.18 173679.61
Lower Northwest 367666.82 370908.29
Lower South 457259.83 507786.96
Lower Southwest 130254.43 127275.47
North 93134.25 94969.27
North Delaware 211661.65 211547.08
River Wards 248720.35 257434.39
South 317909.04 319461.05
University Southwest 243080.49 246246.59
Upper Far Northeast 360528.93 353300.65
Upper North 179616.76 175380.92
Upper Northwest 335241.08 358161.42
West 160516.95 152537.70
West Park 234824.98 222589.28

10 Regression Neighbrohood Effect

Here we compare how including the neighborhood autocorrelation changes the sale price predictions compared to when without it. As shown from the graphs before, there is a statistically significant relationship between the sale price of houses and the neighborhood they are classified to be in. We will create a new model that compares the accuracy of with and without the neighborhood correlation.

reg.nhood <- lm(sale_price ~ ., data = as.data.frame(Philly.training) %>% 
                                 dplyr::select(sale_price, shooting_halfmile, interior_condition, total_livable_area, 
                                           DIST_NAME, mean_profadv,
                                           DevelopmentArea, tree_count, medHHInc, 
                                           exterior_condition, fireplaces, WhiteHOrate, AssistedIncRate,
                                           rest_per_sqmi, number_of_bathrooms)) 


Philly.test.nhood <-
 Philly.test %>%
  mutate(Regression = "Neighborhood Effects",
         sale_price.Predict = predict(reg.nhood, Philly.test),
         sale_price.Error = sale_price.Predict- sale_price,
         sale_price.AbsError = abs(sale_price.Predict- sale_price),
         sale_price.APE = (abs(sale_price.Predict- sale_price)) / sale_price)%>%
  filter(sale_price < 5000000)
bothRegressions <- 
  rbind(
    dplyr::select(Philly.test, starts_with("sale_price"), Regression, DIST_NAME) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error)),
    dplyr::select(Philly.test.nhood, starts_with("sale_price"), Regression, DIST_NAME) %>%
      mutate(lagPriceError = lag.listw(spatialWeights.test, sale_price.Error)))  
st_drop_geometry(bothRegressions) %>%
  gather(Variable, Value, -Regression, -DIST_NAME) %>%
  filter(Variable == "sale_price.AbsError" | Variable == "sale_price.APE") %>%
  group_by(Regression, Variable) %>%
    summarize(meanValue = mean(Value, na.rm = T)) %>%
    spread(Variable, meanValue) %>%
    kable()
## Warning: attributes are not identical across measure variables; they will be
## dropped
## `summarise()` has grouped output by 'Regression'. You can override using the
## `.groups` argument.
Regression sale_price.AbsError sale_price.APE
Baseline Regression 75709.16 0.3579384
Neighborhood Effects 75709.16 0.4117570

10.1 Predicted Sale Price vs the Observed Price

The graph below shows overall accuracy of the two models created compared to actual prices used to train the machine. The neighborhood effect model, while the numbers can show a fairly noticeable increase in accuracy, when graphed, we noticed a minimal change in its predictions. One potential reason is that the model already had some form of “neighborhood effect” that was pre-calculated in the model, and so adding the additional information was mere reinforcement that could only create so much additionality to its predictions.

bothRegressions %>%
  dplyr::select(sale_price.Predict, sale_price, Regression) %>%
    ggplot(aes(sale_price, sale_price.Predict)) +
  geom_point() +
  stat_smooth(aes(sale_price, sale_price), 
             method = "lm", se = FALSE, size = 1, colour="#FA7800") + 
  stat_smooth(aes(sale_price.Predict, sale_price), 
              method = "lm", se = FALSE, size = 1, colour="#25CB10") +
  facet_wrap(~Regression) +
  labs(title="Predicted sale price as a function of observed price",
       subtitle="Orange line represents a perfect prediction; Green line represents prediction") +
  plotTheme()
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

# Spatial Pattern of Both Regressions We can continue to see that there is a correlation between neighborhoods categories and a home’s sale price. This spatial autocorrelation is defined by these two maps that show the difference the neighborhood effect has on predictions. When depicted by district, the map informed us that the baseline model becomes less reliable in the north central area, specifically where there is the lowest values in housing as seen in prior figures.

st_drop_geometry(bothRegressions) %>%
  group_by(Regression, DIST_NAME) %>%
  summarize(mean.MAPE = mean(sale_price.APE, na.rm = T)) %>%
  ungroup() %>% 
  left_join(planning_districts) %>%
    st_sf() %>%
    ggplot() + 
      geom_sf(aes(fill = mean.MAPE)) +
      geom_sf(data = bothRegressions, colour = "black", size = .5) +
      facet_wrap(~Regression) +
      scale_fill_gradient(low = palette5[1], high = palette5[5],
                          name = "MAPE") +
      labs(title = "Mean test set MAPE by neighborhood") +
      mapTheme()
## `summarise()` has grouped output by 'Regression'. You can override using the
## `.groups` argument.
## Joining with `by = join_by(DIST_NAME)`

11 Income and Race

These two maps show the significant income gap between races in Philadelphia. There is an evident parallel between higher income areas and fewer minority populations. There are many reasons historically, politically, and through socioeconomic that create and define these boundaries. But that is also exactly why incorporating race as a factor in the models prior were not recommended.

Adding race to a model adds a cascading effect of discrimination that feeds future predictions to assume race is a pre-determined factor that effects house values. Race is only a factor in housing models because of historical systemic racism and discriminatory policies that created parallels between racial and economic disparity.

grid.arrange(ncol = 2,
  ggplot() + 
    geom_sf(data=planning_districts,fill='beige',color='transparent')+
    geom_sf(data = na.omit(acsTractsPHL.2021), aes(fill = WhiteHOrate)) +
  scale_fill_gradient(low = "black", high = "white", name = "Percentage of White Homeowners")+
    labs(title = "Race Context") +
    mapTheme() + theme(legend.position="bottom"), 
  ggplot() + 
    geom_sf(data=planning_districts,fill='beige',color='transparent')+
    geom_sf(data = na.omit(acsTractsPHL.2021), aes(fill = medHHInc)) +
 scale_fill_gradient(low = "black", high = "green", name = "Median Household Income",
                    na.value = 'grey80',
                    labels = c('$0', '$50k', '$100k', '$150k', '$200k'))+
    labs(title = "Income Context") +
    mapTheme() + theme(legend.position="bottom"))